brazilian portuguese
The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech
Galdino, Julio Cesar, Leal, Sidney Evaldo, De Souza, Leticia Gabriella, Lima, Rodrigo de Freitas, Moreira, Antonio Nelson Fornari Mendes, Junior, Arnaldo Candido, Oliveira, Miguel Jr., Casanova, Edresson, Aluísio, Sandra M.
Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers
Färber, Fernanda Bufon, Brito, Iago Alves, Dollis, Julia Soares, Ribeiro, Pedro Schindler Freire Brasil, Sousa, Rafael Teixeira, Filho, Arlindo Rodrigues Galvão
While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset's deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.
- North America > United States (0.14)
- South America > Brazil > Ceará > Fortaleza (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
Dutra, Lívia, Lorenzi, Arthur, Berno, Laís, Campos, Franciany, Biscardi, Karoline, Brown, Kenneth, Viridiano, Marcelo, Belcavello, Frederico, Matos, Ely, Guaranha, Olívia, Santos, Erik, Reinach, Sofia, Torrent, Tiago Timponi
We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients' visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
- Africa > Togo > Maritime Region > Lome (0.05)
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
- South America > Brazil > Pernambuco > Recife (0.04)
- (5 more...)
Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
Feng, Tiantian, Huang, Kevin, Xu, Anfeng, Shi, Xuan, Lertpetchpun, Thanathai, Lee, Jihwan, Lee, Yoonjeong, Byrd, Dani, Narayanan, Shrikanth
Specifically, we report comprehensive benchmark evaluations on dialects and regional language varieties in English, Arabic, Mandarin and Cantonese, Tibetan, Indic languages, Thai, Spanish, French, German, Brazilian Portuguese, and Italian. Our study used over 2 million training utterances from 30 publicly available speech corpora that are provided with dialectal information. We evaluate the performance of several widely used speech foundation models in classifying speech dialects. We assess the robustness of the dialectal models under noisy conditions and present an error analysis that highlights modeling results aligned with geographic continuity. In addition to benchmarking dialect classification, we demonstrate several downstream applications enabled by Voxlect . Specifically, we show that Voxlect can be applied to augment existing speech recognition datasets with dialect information, enabling a more detailed analysis of ASR performance across dialectal variations. Voxlect is also used as a tool to evaluate the performance of speech generation systems.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > Central America (0.14)
- Europe > United Kingdom > Northern Ireland (0.14)
- (36 more...)
MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
Sarcinelli, João Lucas Luz Lima, Teixeira, Marina Lages Gonçalves, de Paiva, Jade Bortot, Silva, Diego Furtado
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anotações de Registros hIstóricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.
- South America > Brazil > São Paulo (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Markchom, Thanet, Wu, Tong, Huang, Liting, Liang, Huizhi
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
- Europe > Austria > Vienna (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.05)
- Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)
- (4 more...)
Tradutor: Building a Variety Specific Translation Model
Sousa, Hugo, Almasian, Satya, Campos, Ricardo, Jorge, Alípio
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
- Europe > Austria > Vienna (0.14)
- Europe > Switzerland (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- (13 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Performance in a dialectal profiling task of LLMs for varieties of Brazilian Portuguese
Freitag, Raquel Meister Ko, de Gois, Túlio Sousa
Advances in generative AI have enabled near-human responses, crucial for overcoming the Turing test Danziger [2018]. However, achieving this requires algorithms to replicate ethically questionable human behaviors, including biases learned by large language models (LLMs) Freitag [2021]. Biases can be explicit, consciously manipulated, or implicit, operating unconsciously through automatic associations. These biases affect generative AI in two key areas: the rules and filters applied during LLM fine-tuning, and the linguistic datasets used for training. However, the specifics of these biases--whether in rules, filters, or dataset selection--remain unclear Bender et al. [2021].
- South America > Brazil > São Paulo (0.06)
- South America > Brazil > Pernambuco (0.06)
- South America > Brazil > Minas Gerais (0.05)
- (3 more...)
Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation
Paiola, Pedro Henrique, Garcia, Gabriel Lino, Manesco, João Renato Ribeiro, Roder, Mateus, Rodrigues, Douglas, Papa, João Paulo
This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic-100k-en and MedQuAD datasets, translated from English using GPT-3.5, were used to fine-tune the ChatBode-7B model using the PEFT-QLoRA method. The InternLM2 model, with initial training on medical data, presented the best overall performance, with high precision and adequacy in metrics such as accuracy, completeness and safety. However, DrBode models, derived from ChatBode, exhibited a phenomenon of catastrophic forgetting of acquired medical knowledge. Despite this, these models performed frequently or even better in aspects such as grammaticality and coherence. A significant challenge was low inter-rater agreement, highlighting the need for more robust assessment protocols. This work paves the way for future research, such as evaluating multilingual models specific to the medical field, improving the quality of training data, and developing more consistent evaluation methodologies for the medical field.
- South America > Brazil > São Paulo (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.47)
Discriminant audio properties in deep learning based respiratory insufficiency detection in Brazilian Portuguese
Gauy, Marcelo Matheus, Berti, Larissa Cristina, Cândido, Arnaldo Jr, Neto, Augusto Camargo, Goldman, Alfredo, Levin, Anna Sara Shafferman, Martins, Marcus, de Medeiros, Beatriz Raposo, Queiroz, Marcelo, Sabino, Ester Cerdeira, Svartman, Flaviane Romani Fernandes, Finger, Marcelo
This work investigates Artificial Intelligence (AI) systems that detect respiratory insufficiency (RI) by analyzing speech audios, thus treating speech as a RI biomarker. Previous works [2,6] collected RI data (P1) from COVID-19 patients during the first phase of the pandemic and trained modern AI models, such as CNNs and Transformers, which achieved 96.5% accuracy, showing the feasibility of RI detection via AI. Here, we collect RI patient data (P2) with several causes besides COVID-19, aiming at extending AI-based RI detection. We also collected control data from hospital patients without RI. We show that the considered models, when trained on P1, do not generalize to P2, indicating that COVID-19 RI has features that may not be found in all RI types.